Užkrauname reikalingas bibliotekas

library(tidyverse)
library(knitr)

Generuojant ataskaitą galima failo neskaityti kiekvieną kartą iš naujo - cache=TRUE. Nenorint klaidų/informacinių pranešimų pridedame message=FALSE ir warning=FALSE.

df <- read_csv("../1-data/train_data.csv")

Duomenų failo dimensijos:

dim(df)
## [1] 1000000      17

Kintamųjų apžvalga

(dėl gražesnio spaudinimo, naudojame funkciją kable() ir išdaliname kintamuosius į kelias eilutes)

summary(df[1:6])
##        id                y       amount_current_loan     term          
##  Min.   :      1   Min.   :0.0   Min.   : 10802      Length:1000000    
##  1st Qu.: 250001   1st Qu.:0.0   1st Qu.:174394      Class :character  
##  Median : 500001   Median :0.5   Median :269676      Mode  :character  
##  Mean   : 500001   Mean   :0.5   Mean   :316659                        
##  3rd Qu.: 750000   3rd Qu.:1.0   3rd Qu.:435160                        
##  Max.   :1000000   Max.   :1.0   Max.   :789250                        
##  credit_score       loan_purpose      
##  Length:1000000     Length:1000000    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
summary(df[7:13]) %>%
  kable()
yearly_income home_ownership bankruptcies years_current_job monthly_debt years_credit_history months_since_last_delinquent
Min. : 76627 Length:1000000 Min. :0.0000 Min. : 0.00 Min. : 0 Min. : 4.0 Min. : 0.0
1st Qu.: 825797 Class :character 1st Qu.:0.0000 1st Qu.: 3.00 1st Qu.: 10324 1st Qu.:13.0 1st Qu.: 16.0
Median : 1148550 Mode :character Median :0.0000 Median : 6.00 Median : 16319 Median :17.0 Median : 32.0
Mean : 1344805 NA Mean :0.1192 Mean : 5.88 Mean : 18550 Mean :18.1 Mean : 34.9
3rd Qu.: 1605899 NA 3rd Qu.:0.0000 3rd Qu.:10.00 3rd Qu.: 24059 3rd Qu.:22.0 3rd Qu.: 51.0
Max. :165557393 NA Max. :7.0000 Max. :10.00 Max. :435843 Max. :70.0 Max. :176.0
NA’s :219439 NA NA’s :1805 NA’s :45949 NA NA NA’s :529539

Galutinėje ataskaitoje galime neįtraukti R kodo, naudojant echo=FALSE parametrą.

open_accounts credit_problems credit_balance max_open_credit
Min. : 0.00 Min. : 0.0000 Min. : 0 Min. :0.000e+00
1st Qu.: 8.00 1st Qu.: 0.0000 1st Qu.: 113392 1st Qu.:2.700e+05
Median :10.00 Median : 0.0000 Median : 210539 Median :4.600e+05
Mean :11.18 Mean : 0.1762 Mean : 293847 Mean :7.367e+05
3rd Qu.:14.00 3rd Qu.: 0.0000 3rd Qu.: 367422 3rd Qu.:7.674e+05
Max. :76.00 Max. :15.0000 Max. :32878968 Max. :1.540e+09
NA NA NA NA’s :27

TO DO

Apžvelgti NA reikšmes, y pasiskirstymą, character tipo kintamuosius panagrinėti detaliau.

df$loan_purpose <- as.factor(df$loan_purpose)
df$y <- as.factor(df$y)
summary(df$loan_purpose) %>%
  kable()
x
business_loan 17756
buy_a_car 11855
buy_house 6897
debt_consolidation 785428
educational_expenses 992
home_improvements 57517
major_purchase 3727
medical_bills 11521
moving 1548
other 91481
renewable_energy 109
small_business 3242
take_a_trip 5632
vacation 1166
wedding 1129

Arba:

df %>%
  group_by(loan_purpose) %>%
  summarise(n = n())  %>%
  arrange(desc(n)) %>%
  kable()
loan_purpose n
debt_consolidation 785428
other 91481
home_improvements 57517
business_loan 17756
buy_a_car 11855
medical_bills 11521
buy_house 6897
take_a_trip 5632
major_purchase 3727
small_business 3242
moving 1548
vacation 1166
wedding 1129
educational_expenses 992
renewable_energy 109

Pasirinkus kintamuosius juos vizualizuokite

df %>%
  group_by(y, loan_purpose) %>%
  summarise(n = n()) %>%
  ggplot(aes(fill=y, y=n, x=loan_purpose)) + 
  geom_bar(position="dodge", stat="identity") + 
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  theme_dark()

Daugiausiai banktotų imant paskolą šiems tikslams:

df %>%
  filter(y == 1) %>%
  group_by(loan_purpose) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  head(10) %>%
  kable()
loan_purpose n
debt_consolidation 391875
other 44888
home_improvements 27274
business_loan 10356
medical_bills 6286
buy_a_car 5810
buy_house 3652
take_a_trip 2870
small_business 2152
major_purchase 2120

Papildomi pasiūlymai interaktyvumui pagerinti

Interaktyvios lentelės su datatable (DT)

library(DT)
df %>%
  group_by(y, loan_purpose) %>%
  summarise(n = n()) %>%
  datatable()

Interaktyvūs grafikai su plotly

library(plotly)
df %>%
  group_by(y, credit_score) %>%
  summarise(n = n()) %>%
  plot_ly(x = ~credit_score, y = ~n, name = ~y, type = "bar")